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Review Of Twenty Nonparametric Statistics And Their Large Sample Approximations 



Nonparametric procedures are often more powerful than classical tests for real 
world data which are rarely normally distributed. However, there are difficulties 
in using these tests. Computational formulas are scattered throughout the 
literature, and there is a lack of availability of tables of critical values. We bring 
together the computational formulas for twenty commonly employed 
nonparametric tests that have large-sample approximations for the critical value. 
Because there is no generally agreed upon lower limit for the sample size, we use 
Monte Carlo methods to determine the smallest sample size that can be used with 
the large-sample approximations. The statistics reviewed include single- 
population tests, comparisons of two populations, comparisons of several 
populations, and tests of association. 



I 



Review Of Twenty Nonparametric Statistics And Their Large Sample Approximations 

Classical parametric tests, such as the F and t, were developed in the early part of the twentieth 
century. These statistics require the assumption of population normality. Bradley (1968) wrote, “To the 
layman unable to follow the derivation but ambitious enough to read the words, it sounded as if the 
mathematician had esoteric mathematical reasons for believing in at least quasi-universal quasi- 
normality” (p. 8). “Indeed, in some quarters the normal distribution seems to have been regarded as 
embodying metaphysical and awe-inspiring properties suggestive of Divine Intervention” (p. 5). 

However, when Micceri (1989) investigated 440 large-sample education and psychology data 
sets, he concluded “No distributions among those investigated passed all tests of normality, and very 
few seem to be even reasonably close approximations to the Gaussian” (p. 161). This is of great 
practical importance because even though the well known Student’s t test is preferable to nonparametric 
competitors when the normality assumption has been met, Blair and Higgins (1980) noted: 

Generally unrecognized, or at least not made apparent to the reader, is the fact that the t 
test’s claim to power superiority rests on certain optimal power properties that are 
obtained under normal theory. Thus, when the shape of the sampled population(s) is 
unspecified, there are no mathematical or statistical imperatives to ensure the power 
superiority of this statistic, (p. 311) 

Blair and Higgins (1980) demonstrated the power superiority of the nonparametric Wilcoxon 
Rank Sum test over the t test for a variety of nonnormal theoretical distributions. In a Monte Carlo study 
of Micceri’s real world data sets, Sawilowsky and Blair (1992) concluded that although the t test is 
generally robust with respect to Type I errors under conditions of equal sample size, fairly large 
samples, and two-tailed tests, it is not powerful for skewed distributions. Under these conditions, the 
Wilcoxon Rank Sum test is three to four times more powerful. See also Bridge and Sawilowsky ( 1999 ) 
and Nanna and Sawilowsky (1998). 

It is appropriate to consider further this class of statistics because of the power advantages of 
nonparametric tests with real world data. The terms ‘nonparametric’ and ‘distribution-free’ are often 
used interchangeably to describe tests that make few, if any, assumptions about the distribution of the 
population. There is, however, a distinction between them. Bradley (1968) explained that "a 
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nonparametnc test is one which makes no hypothesis about the value of a parameter in a statistical 
density function, whereas a distribution- free test is one which makes no assumptions about the precise 
form of the sampled population” (p. 15). In this paper we are concerned with nonparametnc procedures. 

A difficulty in using nonparametnc tests is the availability of computational formulas and tables 
of critical values. For example, Siegel and Castellan (1988) noted, “Valuable as these sources are, they 
have typically either been highly selective in the techniques presented or have not included the tables of 
significance (p. xvi). This continues to be a problem as evidenced by our survey of 20 in-print generic 
college statistics textbooks, including seven general textbooks, eight for the social and behavioral 
sciences, four for business, and one for engineering. Formulas were given for only eight nonparametric 
statistics, and tables of critical values were given for only the following six: (a) Kolmogorov-Smimov 
test, (b) Sign test, (c) Wilcoxon Signed Rank test, (d) Wilcoxon (Mann- Whitney) test, (e) Spearman’s 
rank correlation coefficient, and (f) Kendall’s rank correlation coefficient. 

This situation is somewhat improved for nonparametric statistics textbooks. Eighteen 
nonparametric textbooks published since 1956 were also reviewed. The most comprehensive texts in 
terms of coverage were Neave and Worthington (1988) which is out of print and Deshpande Gore, and 
Shanubhogue (1995). Table 1 contains the statistical content of the eighteen textbooks. The comment by 
Laubscher, Steffens, and De Lange (1968) on the Mood test summarized the findings: “As far as we 
know the main drawback in using this test statistic, developed more than 14 years ago, lies in the fact 
that its distribution has never been tabulated except for a few isolated cases” (p. 497). 



Table 1. Results of Survey of 18 Nonparametric Books. 



Statistic 


Number of Books 
That Included Tables of 
Critical Values 


Single PoDulation Tests 


Kolgomorov-Smimov Goodness-of-Fit Test 


11 


Sign Test 


4 


Wilcoxon’s Signed Rank Test 


14 


Comparison of Two Populations 


Kolmogorov-Smimov Two Sample Test 


11 


Rosenbaum’s Test 


1 
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Wilcoxon (Mann- Whitney) Test 14 

Mood Test 1 

Savage Test j 

Ansari-Bradley Test 1 

Comparison of Several Populations 

Kruskal- Wallis Test jq 

Friedman’s Test g 

Terpstra^Jonckheere Test . 5 

Page’s Test 4 

Match Test for Ordered Alternatives 1 

Tests of Association 

Spearman’s Rank Correlation Coefficient 12 

Kendall’s Rank Correlation Coefficient 10 

Many nonparametric tests have large sample approximations that can be used as an alternative to 
tabulated critical values. These approximations are useful substitutes if the sample size :is sufficiently 
large, and hence, obviate the need for locating tables of critical values. However, there is no generally 
agreed upon definition of what constitutes a large sample size. Consider the Sign test and the Wilcoxon 
tests as examples. 

Regarding the Sign test, Hajek (1969) wrote, “The normal approximation is good for N > 12” 
(p. 108). Gibbons (1971) agreed, “Therefore, for moderate and large values of N (say at least 12 ) it is 
satisfactory to use the normal approximation to the binomial to determine the rejection region” (p. 102 ). 
Both Sprent (1989) and Deshpande, Gore, and Shanubhogue (1995), however, recommended n greater 
than 20 . Siegel and Castellan (1988) suggested n > 35, but Neave and Worthington (1988) proposed 
that n > 50. 

The literature regarding the Wilcoxon Rank Sum test is similarly disparate. Deshpande, Gore, 
and Shanubhogue (1995) stated that the combined sample size should be at least 20 to use a large sample 
approximation of the critical value. Conover (1971) and Sprent (1989) recommended that one or both 
samples must exceed 20. Gibbons (1971) placed the lower limit at twelve per sample. For the Wilcoxon 
Signed Rank test, Deshpande, Gore, and Shanubhogue (1995) said that the approximation can be used 
when n is greater than 10. Gibbons (1971) recommended it when n is greater than 12, and Sprent ( 19 S 9 ) 
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required n to be greater than 20, The general lack of agreement may indicate that these 
recommendations are based on personal experience, the sample sizes in available tables, the author’s 
definition of acceptable” or “large”, or some other criterion. 

There are two alternatives to tables and approximations. The first is to use exact permutation 
methods. There is software available that will generate exact p-values for small data sets and Monte 
Carlo estimates for larger problems. See Ludbrook and Dudley (1998) for a brief review of the 
capabilities of currently available software packages for permutation tests. However, these software 
solutions are expensive, have different limitations in coverage of procedures, and may require 
considerable computing time even with fast personal computers (see, e.g., Musial, 1999; Posch & 
Sawilowsky, 1997). In any case, a desirable feature of nonparametric statistics is that they are easy to 
compute without statistical software and computers, which makes their use in the classroom or work in 
the field attractive. 

A second alternative is the use of the rank transformation (RT) procedure developed by Conover 
and Iman (1981). They proposed the use of this procedure as a bridge between parametric and 
nonparametric techniques. The RT is earned out as follows: rank the original scores, perform the 
classical test on the ranks, and refer to the standard table of critical values. In some cases, this procedure 
results in a well-known test. For example, conducting the / test on the ranks of original scores in a two 
independent samples layout is equivalent to the Wilcoxon Rank Sum test. (However, see the caution 
noted by Sawilowsky & Brown, 1991). In other cases, such as factorial analysis of variance (ANOVA) 
layouts, a new statistic emerges. 

The early exuberance with this procedure was related to its simplicity and promise of increased 
statistical power when data sets displayed nonnormality. Iman and Conover noted the success of the RT 
in the two independent samples case and the one-way ANOVA layout. Nanna (1997) showed that the 
RT is robust and powerful as an alternative to the independent samples multivariate Hotelling’s T 2 . 

However, Blair and Higgins (1985) demonstrated that the RT suffers power losses in the 
dependent samples t test layout as the correlation between the pretest and posttest increases. Bradstreet 
(1997) found the RT to perform poorly for the two sample Behrens-Fisher problem. Sawilowsky (1985), 
Sawilowsky, Blair, and Higgins (1989), Blair, Sawilowsky, and Higgins (1987), and Kelley and 
Sawilowsky (1997) showed the RT has severely inflated Type I errors and a lack of power in testing 
interactions in factorial ANOVA layouts. Harwell and Serlin (1997) found the RT to have inflated Type 
I errors in the test of P = 0 in linear regression. In the context of analysis of covariance, Headrick and 
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Sawilowsky (1999, 2000) found the RT’s Type I error rate inflates quicker than the general ANOVA 
case, and it demonstrated more severely depressed power properties. Recent results by Headrick 
(personal communications) shows the RT to have poor control of Type I errors in the ordinary least 
squares multiple regression layout. Sawilowsky (1989) stated that the RT as a bridge has fallen down, 
and cannot be used to unify parametric and nonparametric methodology or as a method to avoid finding 
formulas and critical values for nonparametric tests. 



The Current Study 

As noted above, the computational formulas for many nonparametric tests are scattered 
throughout the literature, and tables of critical values are scarcer. Large sample approximation formulas 
are also scattered and appear in different forms. Most important, the advice on how “large” a sample 

must be to use the approximations is conflicting. The purpose of this study is to ameliorate all five of 
these problems. 

Ascertaining the smallest sample size that can be used with a large sample approximation for the 
various statistics would enable researchers who do not have access to the necessary tables of critical 
values or statistical software to employ these tests. The first portion of this paper uses Monte Carlo 
methods to determine the smallest sample size that can be used with the large sample approximation 
while still preserving nominal alpha. The second portion of this paper provides a comprehensive review 
of computational formulas with worked examples for twenty nonparametric statistics. They were chosen 

because they are commonly employed and because large sample approximation formulas have been 
developed for them. 

Methodology 

Each of the twenty statistics was tested with normal data and Micceri’s (1989; see also 
Sawilowsky, Blair, & Micceri, 1990) real world data sets. The real data sets represent smooth 
symmetric, extreme asymmetric, and multi-modal lumpy distributions. Morite Carlo methods were used 
in order to determine the smallest samples that can be used with large-sample approximations. 

A program was written in Fortran 90 (Lahey, 1998) for each statistic. The program sampled with 
replacement from each of the four data sets for n = 1,2, ... N; n, = n 2 = (2, 2), (3,3), ... (N|,N 2 ) f and so 
forth as the number of groups increased. The statistic was calculated and evaluated using the tabled 
values when available and the approximation of the critical value. The number of rejections was counted 
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and the Type I error rate was computed. Nominal a was set at .05 and .01. Bradley’s (1978) 
conservative estimates of .045 < Type I error rate < .055 and .009 < Type I error rate <.011 were used, 
respectively, as measures of robustness. The sample sizes were increased until the Type I error rates 
converged within these acceptable regions. 

Assumptions and Limitations 

In many cases there are different formulas for the large sample approximation of a statistic. Two 
criteria were used in choosing which formula to include: (a) consensus of authors, and (b) ease of use in 
computing and programming. Some of the statistics have different large sample approximations based 
on the presence of ties among the data. The formulas not based on ties were used because we corrected 
for ties using average ranks. 

Data Sets For Worked Examples In This Article 

The worked examples in this study used five data sets that may be found in Table 3 (Appendix). 
Some statistics converged at relatively large sample sizes. In choosing the sample size for the worked 
example, we compromised between the amount of computation required for large samples and an 
unrepresentatively small but convenient sample size. Therefore, we selected a sample size of n = 15, 
recognizing that some statistics’ large sample approximations do not converge within Bradley’s (1968) 
limits for this small sample size. The data sets were randomly selected from Micceri’s (1989) 
multimodal lumpy data set. Table 4 (Appendix). Because the samples came from the same population, 
the worked examples all conclude that the null hypothesis cannot be rejected. 

Statistics Examined 

The twenty statistics included in this article represent four layouts: (1) single population tests, (2) 
comparison of two populations, (3) comparison of several populations, and (4) tests of association. 
Single-populations tests included: (a) a goodness-of-fit test, (b) tests for location, and (c) an estimator of 
the median. Comparisons of two populations included: (a) tests for general differences, (b) two-sample 
location problems, and (c) two-sample scale problems. Comparisons of several populations included: (a) 
ordered alternative hypotheses, and (b) tests of homogeneity against omnibus alternatives. Tests of 
association focused on rank correlation coefficients. 
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Results 



Table 2 shows the minimum sample sizes for the tests studied. These recommendations are based 
on results that converged when underlying assumptions are reasonably met. The minimum sample-sizes 
are conservative, representing the largest minimum for each test. If the test had three or more samples, 
the largest group minimum was chosen. Consequently the large-sample approximations will work in 
some instances for smaller sample sizes. Where the test involves more than one sample, the smallest 
sample size refers to the smallest sample size for each equal sample. 



Table 2. Minimum Sample Size for Large-Sample Approximations. 

Test a= .05 a= .01 

Single Population Tests 

Kolmogorov-Smimov Goodness-of-Fit Test 25 < n < 40 

Sign Test n>150 

Wilcoxon Signed Rank Test 10 

Estimator of Median for a Continuous Distribution n > 150 

Comparison of Two Populations 



Kolmogorov-Smimov Two Sample Test 


n > 150 


n> 150 


Rosenbaum’s Test 


16 


20 


Tukey’s Test 


10<n< 18 


21 


Wilcoxon (Mann- Whitney) Test 


15 


29 


Hodges-Lehmann Estimator 


15 


20 


Siegel-Tukey Test 


25 


38 


Mood Test 


5 


23 


Savage Test 


11 


31 


Ansari-Bradley Test 


16 


29 


Comparison of Several Populations 


Kruskal-Wallis Test 


11 


22 


Friedman’s Test 


13 


23 


Terpstra-Jonckheere Test 


4 . 


8 


Page’s Test (k > 4) 


11 


18 


The Match Test for Ordered Alternatives (k > 3) 


86 


27 



28 < n < 50 
n > 150 
22 

n> 150 
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Tests of Association 

Spearman’s Rank Correlation Coefficient 1 2 

Kendall’s Rank Correlation Coefficient 14 < n < 24 



Some notes and cautionary statements are in order with regard to the entries in Table 3. The 
Monte Carlo methods were completed for n = 1, 2, ... 150. The Kolmogorov-Smimov goodness-of-fit 
test was conservative for values below the minimum value stated and liberal for values above the 
maximum value. Results for the Sign test indicate convergence for some distributions may occur close 
to n = 150. The results for the confidence interval for the Estimator of the Median suggest convergence 
may occur close to n = 150 only for normally distributed data. However, for the nonnormal data sets the 
Type I error rates were quite conservative (e.g., for a = .05 the Type I error rate was only 0.01 146 and 
for a = .01 it was only 0.00291 for n = 150 and the extreme asymmetric data set). 

The Kolmogorov-Smimov test was erratic, with no indication convergence would be close to 
150. Results for Tukey’s Test were conservative for a = .05 when the cutoff for the p-value was .05, and 
fell within acceptable limits for some sample sizes when .055 was used as a cutoff. The Hodges- 
Lehmann Estimator only converged for normal data. For nonnormal data the large sample 
approximation was extremely conservative with n = 10 (e.g., for the extreme asymmetric data set the 
Type I error rate was only 0.0211 and 0.0028 for the .05 and .01 alpha levels, respectively) and 
increased in conservativeness (i.e., the Type I error rate converged to 0.0) as n increased. The Match test 
only converged for normally distributed data, and it was the only test where the sample size required for 
a = .01 was smaller than for a = .05. 

Statistics, Worked Examples, Large Scale Approximations 
Single Population Tests 

Goodness-of-fit statistics are single-population tests of how well observed data fit expected 
probabilities or a theoretical probability density function. They are often used as a preliminary test of the 
distribution assumption of parametric tests. The Kolmogorov-Smimov goodness-of-fit test was studied 

Tests for location are used to make inferences about the location of a population. The measure of 
location is usually the median. If the median is not known but there is reason to believe that its value is 
M 0 , then the null hypothesis is H 0 : M =M 0 . The tests for location studied were the Sign test. 
Wilcoxon’s Rank Sum test, and the Estimator of the Median for a Continuous Distribution. 
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Kolmogorov-Smimov Goodness-of-Fit Test 

The Kolmogorov-Smimov (K-S) goodness-of-fit statistic was devised by Kolmogorov in 1933 
and Smirnov in 1939. It is a test of goodness-of-fit for continuous data, based on the maximum vertical 
deviation between the empirical distribution function, F n (x), and the hypothesized cumulative 
distribution function, Fq(x). Small differences support the null hypothesis while large differences are 
evidence against the null hypothesis. 

The null hypothesis is H 0 : F n (x) = F 0 (x) for all * and the alternative hypothesis is 

'■ F n ( x ) * Fo( X ) for at least some x where Fo(x) is a completely specified continuous distribution. The 
empirical distribution function, F n {x), is a step function, defined as: 






number of sample val ues < x 

n 



( 1 ) 



where n = sample size. 

Test statistic. 

The test statistic, D n , is the maximum vertical distance between the empirical distribution 
function and the cumulative distribution function. 



°n = max[max|F n (x.) - F 0 (* ( .)|,max|F,, (*,_,) - F 0 (x,.)|] 



( 2 ) 



Both vertical distances (x, ) - F 0 (x, ) and F n (x )-F 0 (x,. ) have to be calculated in order to 
find the maximum deviation. The overall maximum of the two calculated deviations is defined as D n . 

For a one-tailed test against the alternatives //, : F n {x) > F 0 (x) or H, : FJx) < F 0 (x) for at least 
some values ofx, the test statistics are respectively: 

£>; =max[F n (x)-F 0 (x)J (3) 



or 



ZT = max[F 0 (x) - F„(x)] (4) 

The rejection rule is to reject Ho when D n > D n a where D n a is the critical value for a given n and a level 
of significance. 

Large sample sizes. 

The null distribution of 4 nD* 2 (or 4 nD ~~ ) is approximately % 2 with 2 degrees of freedom. Thus, 
the large sample approximation is 
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where xh is the value for chi-square with 2 degrees of freedom for the appropriate alpha level and n is 
the sample size. 

Example. 

The K-S goodness-of-fit statistic was calculated for Sample 1 in Table 3 (Appendix), n = 15, 
against the cumulative frequency distribution of the multimodal lumpy data set. The maximum 
difference at step was 0.07463 and the maximum difference before step was 0.142610. Thus the value of 

Ax is 0.142610. For a two-tail test with a = .05, the large sample approximation is 1.358l/Vn = 
1 .358 1/ Vl~5 —0.35066. Because 0. 142610 < 0.35066, the null hypothesis cannot be rejected. 

The Sign Test 

The Sign test is credited to Fisher as early as 1925. One of the first papers on the theory and 
application of the sign test is attributed to Dixon and Mood in 1946 (Hollander & Wolfe, 1973). 
According to Neave and Worthington (1988), the logic of the Sign test is “almost certainly the oldest of 
all formal statistical tests as there is published evidence of its use long ago by J. Arbuthnott (1710)!” (p. 
65). 

The Sign test is a test for a population median. It can also be used with matched data as a test for 
equality of medians. The test is based upon the number of values above or below the hypothesized 
median. Gibbons (1971) referred to the sign test as the nonparametric counterpart of the one-sample t 
test. The sign test tests the null hypothesis H 0 : M = A/ 0 where M is the sample median and Mo is the 
hypothesized population median against the alternative hypothesis H t :M *M 0 . One-tailed test 
alternative hypotheses are of the form H X :M < \f 0 and H x :M > M 0 . 

Procedure. 

Each Xi is compared with Mo. If x, > M 0 then a plus sign *+’ is recorded. If x ( . < M 0 then a minus 
sign ’ is recorded. In this way all data are reduced to *+’ and signs. 

Test statistic. 

The test statistic is the number of *+’ signs or the number of signs. If the expectation under 
the alternative hypothesis is that there will be a preponderance of *+’ signs, the test statistic is the 
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number of signs. Similarly, if the expectation is a preponderance of signs, the test statistic is the 
number of + signs. If the test is two-tailed, use the smaller of the two. Thus, 



S the number of + or ’ signs (depending upon the context) 



( 6 ) 



Large sample sizes. 

The large sample approximation is given by 




(7) 



where S is the test statistic and n is the sample size. S* is compared to the standard normal z scores for 
the appropriate a level. 

Example . 

The Sign test was calculated using Sample 1 in Table 3 (Appendix), n = 15. The population 
median is 18.0. The number of negative values is 7 and the number of positive values is 8. Therefore S = 
7. The large sample approximation, S\ using formula (7) is -.258199. Because -.258199 > -1.95996, the 
null hypothesis cannot be rejected. 



Wilcoxon’s Signed Rank Test 

Wilcoxon’s Signed Rank test was introduced by Wilcoxon in 1945. The statistic uses the ranks 
of the absolute differences between x, and Mq along with the sign of the difference. This uses the relative 
magnitudes of the data. This statistic can also be used to test for symmetry and to test for equality of 
location for paired replicates. 

The null hypothesis is H 0 \ M =A/ 0 against the alternative H { \M ^A/ 0 . The alternative may 
also be one-sided, H { : M > M 0 or //, : M < A/ 0 . 

Procedure. 

Compute the differences, £>„ by the formula 

A=*,-A / 0 (S) 

Rank the absolute value of the differences, in ascending order, keeping track of the individual signs. 
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Test statistic. 



The test statistic is the sum of either the positive ranks or the negative ranks. If the alternative 
hypothesis suggests that the sum of the positive ranks should be large, 
then 



T - the sum of ranks of the negative differences 
If the alternative hypothesis suggests that the sum of the negative ranks should be large, then 

T = the sum of ranks of the positive differences 



( 9 ) 

( 10 ) 



For a two-tailed test, T is the smaller of the two rank-sums. The total sum of the ranks is — + 1 ^ which 

2 ’ 

gives the following relationship: 



_ n{n + 1) 
2 

Large sample sizes. 

The large sample approximation is 



(ID 



j- njn + Y) 

/ n(n + l)(2rt + l) 

V 24 

where T is the test statistic and n is the sample size. The resulting z is compared to the standard normal z 
for the appropriate alpha level. 

Example. 

The Signed Rank test was computed using the data from Sample l in Table 3 (Appendix), n = 
15. The median of the population is 18.0. Tied differences were assigned midranks. The sum of the 
negative ranks was 38.5 and the sum of the positive ranks was 81.5. Therefore the Signed Rank statistic 



is 38.5. The large sample approximation is " 

V310 

1 .95996, the null hypothesis is not rejected. 



-21.5 

17.6068 



-1.22112. Because -1.22112 > - 



Estimator of the Median for a Continuous Distnbution 

The sample median is the point estimate ot the population median. This procedure provides a 1 - 
a confidence interval for the population median. It was designed to be used with continuous data. 
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Procedure. 

Let n be the size of the sample. Order the n observations in ascending order, 
' v (d - x ( 2 ) - ■■■ - x (n) • Let x (0) = -co and * ((i+ 1 ) =co. These n + 2 values form n + 1 intervals 

(■ v ( 0 ) •■*(!))» (■*(!) **( 2 ) )» • • • The i th interval is defined as (^r (/ _ 0 , a: ( i) ) with i= 1,2, 

• • • » n ‘ n + 1- The probability that the median is in any one interval is based on the binomial distribution. 
The confidence interval for the median given the confidence coefficient 1 - a, requires that an r be found 
such that the sum of the probabilities of the intervals in both the lower and upper ends give the best 
conservative approximation of a/2, according to the following: 



r 

I 






UJ 2- 



a 

2 



J=n-r 






\JJ 



_ 1 _ 

2 " 



(13) 



Thus (x (r) , x ( ,+i)) is the last interval in the lower end making x (rH) the lower limit of the confidence 
interval. By a similar process, x (n . r) is the upper limit of the confidence interval. 

Large sample sizes. 

According to Deshpande, Gore, and Shanubhogue (1995) “one may use the critical points of the 
standard normal distribution, to choose the value of r + 1 and n — r, in the following way”: r + 1 is the 
integer closest to 



where Za /2 is the upper a/2 critical value of the standard normal distribution. 

Example. 

The data from Sample 1 in Table 3 (Appendix), n = 15, were used to compute the estimator of 
the median. The population median is 18.0. For the given n and a = .05, the value of r is 3. The value of 
r + 1 is 4, and n - r is 12. The 4 th value is 13 and the 12 lh value is 33. Therefore the interval is (13. 33). 
The large sample approximation yields 7.5 - 1.95996(1.9365) = 7.5 - 3.70 = 3.80. The closest integer is 
r + 1 = 4, so r = 3 and n - r = 12, resulting in the same interval, (13, 33). The interval contains the 
population median, 18. 
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Two Sample Problems 

The two-sample problem consists of two independent random samples drawn from two 
populations. This study examined two sample tests for general differences, two sample location 
problems, and two sample scale problems. 

When differences between two samples are not expected to be predominantly differences in 
location or differences in scale, a test for general differences is appropriate. Generally differences in 
variability are related to differences in location. Two tests for differences were considered, the 
Kolmogorov-Smimov test for general differences and Rosenbaum’s test. 

Two sample location problems involve tests for a difference in location between two samples 
when the populations are assumed to be similar in shape. The idea is that f x (x) = f 1 ( x + 6) or 

A( x ) = f 2 (x ~ 0) where 9 is the distance between the population medians. Tukey’s quick test, the 
Wilcoxon (Mann- Whitney) statistic, and the Hodges-Lehmann estimator of the difference in location for 
two populations were considered. 

In two sample scale problems, the population distributions are usually assumed to have the same 
location with different spreads. However, Neave and Worthington (1988) cautioned that tests for 
difference in scale could be severely impaired if there is a difference in location as well. The following 
nonparametric tests for scale were studied: the Siegel-Tukey test, the Mood test, the Savage test for 
positive random variables, and the Ansari-Bradley test. 

Kolmogorov-Smimov Test for General Differences 

The Kolmogorov-Smimov test compares the cumulative distribution frequencies of the two 
samples to test for general differences between the populations of the samples. The sample cdf “is an 
approximation of the true cdf of the corresponding population - though, admittedly, a rather crude one if 
the sample size is small” (Neave & Worthington, 1988, p. 149). This property was used in the goodness- 
of-fit test above. Large differences in the sample cdfs can indicate a difference in the population cdfs, 
which could be due to differences in location, spread, or more general differences in the distributions. 
The null hypothesis is H 0 : F x (x) = F 2 (x) for all .r and the alternative hypothesis is H x : F x {x) * F,(x) for 
some x. 
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Procedure. 



The combined observations are ordered from smallest to largest, keeping track of the sample 
membership. Above each score, write the cdf of sample 1, and below each score write the cdf of sample 
2. Because the samples are of equal sizes, it is only necessary to use the numerator of the cdf. For 

example, the cdf(.r,) = - . Then write i above for sample 1. Find the largest difference between the cdf 

for sample 1 and the cdf for sample 2. 

Test statistic. 

The test statistic is D*. D* = mnD, and D* = n 2 D for equal sample size. The above procedure 
yields nD. Thus 



D* = n(nD) (15) 

The greatest difference found by the procedure is multiplied by the sample size. 

Large sample sizes. 

As sample size increases, the distribution is approximately chi-squared with 2 degrees of 
freedom, as it is for the goodness-of-fit test. The large sample approximation for D is 



D = }_ fzlAm + n) 
2 V mn 



(16) 



where xl.i is the value for chi-square with 2 degrees of freedom for the appropriate alpha level and n. m 

are the two sample sizes. The resulting D is used in formula (15). 

Example. 

This example used the data from Sample 1 and Sample 5 in Table 3 (Appendix), n = m = 15. The 
greatest difference (nD) between the cdfs of the two samples is nD = 3. Therefore £>* = 15(3) = 45. The 



large sample approximation is 15 2 (1.3581)^— = 225(1. 358 1)(.365 148) = 111.579301. Because 45 < 
1 1 1.579301, the null hypothesis cannot be rejected. 



Rosenbaum’s Test 

Rosenbaum’s test, which was developed in 1965, is useful in situations where an increase in the 
measure of location implies an increase in variation It is a quick and easy test based on the number of 
observations in one sample greater than the largest observation in the other sample. 

The null hypothesis is that both populations have the same location and spread against the 
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alternative, that both populations differ in location and spread. 

Procedure. 

The largest observation in each sample is identified. If the largest overall observation is from 
sample 1, then the number of observations from sample 1 which are greater than the largest observation 
from sample 2 are counted. If the largest overall observation is from sample 2, then the number of 

observations from sample 2 which are greater than the largest observation from sample 1 are counted. 
Test statistic. 

The test statistic is the count of the extreme observations. R is the number of observations from 
sample 1 greater than the largest observatton in sample 2 or the number of observations from sample 2 
greater than the largest observation in sample 1 . 

Large sample sizes. 

As sample sizes increase, i -► p and the probability that the number of extreme values equals h 
approaches p h . 

Example. 

Rosenbaum’s statistic was calculated using Samples 1 and 5 in Table 3 (Appendix), n,=n 1 =ls. 
The maximum value from Sample 1 is 39, and from Sample 2, 33. There are three values from Sample 1 
greater than 33, namely 34, 36, and 39. Hence R = 3. The large sample approximation is (.5)> - 0.125. 
Because 0.125 > .05, the null hypothesis cannot be rejected. 

Tukey’s Quick Test 

Tukey published a quick and easy test for the two sample location problem in 1959. It is easy to 
calculate and in most cases does not require the use of tables. The most common one-tailed critical 
values are 6 (a = .05) and 9 (a = .01) for most sample sizes. The statistic is based on the sum of the 
extreme runs. If there is a difference in location between samples X and Y, one would expect more X’s 
at one end and Y s at the other end when the combined samples are ordered. 

Procedure. 

The combined samples can be ordered, but it is only necessary to order the largest and smallest 
elements. If both the maximum and minimum value come from the same sample the test is finished, the 
value of T y = 0, and the null hypothesis is not rejected. 

For the one-tailed test, the lower end run should come from the sample expected to have the 
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lower median and the upper run from the sample expected to have the larger median. For a two-tailed 
test, it is possible to proceed with the test as long as the maximum and minimum come from different 
samples. 

Test statistic. 

T y is defined as follows for //, = M y > M x . T y is the number of X’s less than the smallest value 
of Y plus the number of Y’s greater than the largest value of X. If H X =M X > M y then the samples are 

reversed. For the two-tailed hypothesis both possibilities are considered. 

Critical values. 

As stated above, generally, the critical value for a = .05 is 6, and is 9 for a = .01. There are 
tables available. As long as the ratio of n x to n y is within 1 to 1.5, these critical values work well. There 
are corrections available when the ratio exceeds 1.5. For a two-tailed test the critical values are 7 (a = 
.05) and 10 (a = .01). 

Large sample sizes. 

The null distribution is based on the order of the elements of both samples at the extreme ends. It 
does not depend upon the order of the elements in the middle. The formula for the probability that T y >h 
is the sum of a finite geometric series, 

Prob(7; > h) = (17) 

q-p 

When the sample sizes are equal, p = q = .5. Then the probability of T y > h is h ■ 2~ ( ' l * 1) . For a two-tailed 
test the probability is doubled. 

Example. 

The Tukey test was calculated using the data in Sample 1 and Sample 5 in Table 3 (Appendix), n 
= m = 15. The maximum value, 39, is from Sample 1 and the minimum, 2, is from Sample 5 so the test 
may proceed. The value of T y = 1 + 3 = 4. For a two-tailed test with a = .05, the large sample 
approximation is 2(4)(2‘ 5 ) = 0.25. Because 0.25 > .05, the null hypothesis cannot be rejected. 

Wilcoxon (Mann- Whitney) Statistic 

In 1945, Wilcoxon introduced the Rank Sum test at the same time as the Signed Rank test. Mann 
and Whitney introduced a different version of the test in 1947. The Wilcoxon statistic is easily conv ened 
to the Mann- Whitney U statistic. The hypotheses of the test are H 0 : F l (x) = F 2 (x) for all x against the 
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two-tailed alternative, H 0 : Fj (.r) * F 2 (x) . The one-tailed alternative is //, : F x (x) = F 2 (x + Q) 

Procedure. 

For the Wilcoxon test, the combined samples are ordered, keeping track of sample membership. 
The ranks of the sample that is expected, under the alternative hypothesis, to have the smallest sum, are 
added. The Mann-Whitney test is as follows. Put all the observations in order, noting sample 
membership. Count how many of the observations of one sample exceed each observation in the first 
sample. The sum of these counts is the test statistic, U. 

Test statistic. 

For the Wilcoxon test, 



( 18 ) 

Where Rj are the ranks of sample n and S„ is the sum of the ranks of the sample expected to have 
the smaller sum 

For the Mann-Whitney test, calculate the U statistic for the sample that is expected to have the 
smaller sum under the alternative hypothesis. 

U m - the sum of the observations in n exceeding each observation in m 
U n — the sum of the observations in m exceeding each observation in n 
There is a linear relation between S n and U„. It is expressed as 

u m =S m -^m(m + 1 ) 

and similarly, 



( 19 ) 

( 20 ) 

( 21 ) 



= Sn "«(* + !) 

where 

U m = mn - U n 

In a two-tailed test, use the smallest [/statistic to test for significance. 



( 22 ) 



( 23 ) 



Large sample sizes. 

The large-sample approximation using the Wilcoxon statistic, S n is: 
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S. - 



n(n + m + l) 



7 — 



mnjrn + n + l) 

V 12 

The large-sample approximation with the U statistic is 

tt 1 1 

2 2 



(24) 



z = 



l /»ft(/H + /7 + l) 
12 



(25) 



In either case, reject Ho if z < -z„ (or z < - Za/2 for a two-tailed test). 

Example. 

The Wilcoxon (Mann- Whitney, Rank Sun, statistic was calculated with data from Sample , and 
Sample 5 tn Table 3 (Appendix,, , - » - ,5. The combined samples were ranked, ustng midranks for 
hes. The rank sum for Sample 1 was 258.5 and for Sample 5, 206.5. Hence S = 206.5. Calculating the U 
statistic, U = 206.5-0.5(15,(16) = 86.5. The large sample approximation for the U statistic is 
86.5 + .5-.5Q5 2 ) -25.5 _ 

/ l5 J (31, ' 2U091 = 05769 - Becausc -1.05769 > -1.95996, the null hypothesis cannot be 

V 12 

rejected. 



Ho dges-Lehmann Estimator o f the Differenr.P in r net;.,. 

When a difference in locatton exists, i, may be approbate to develop an estimate of the 
difference. Suppose there are two populates tha, are assumed to have similar shaped distnbutions, bu, 
have different locations. The problem is to develop a confidence interval tha, win have the probability of 
1 - a that the actual difference lies in the interval. 

Procedure 

All the pairwise differences are computed. x,-y, . For sample sizes of* and n, there are ™ 
differences. The differences are pu, in ascending order. The task is to hud two integers / and „ such that 
the probability that the difference lies between / and „ is equal to 1 - a. These limits are chosen 
symmetncally. The appropriate lower tail critical value is found for the Mann- Whitney U statistic. This 
value is the upper limit of the lower end of the differences. Therefore l is the next consecutive integer. 
The upper limit of the confidence interval is the /"’difference from the upper end. Using the relations!,, p 
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/ + u - mn + 1, u = mn - / + 1. The interval (/, u) is the confidence interval for the difference in location 
for the two populations. 

Large sample sizes. 

“/ and u may be approximated by 



/ = 



u = 



mn /mn(m + n + l) 

2 Za/ 1 12 



mn [rnntm + fTi^ 

T +Z “i U 



1 

2 

1 

2 



(26) 

(27) 



where the square brackets denote integer nearest to the quantity within, and ton is the suitable upper 
critical point of the standard normal distribution” (Deshpande, Gore, & Shanubhogue, 1995, p. 45). 

Example. 

The Hodges-Lehmann estimate of the difference in location was computed using Samples 1 and 
5 in Table 3 (Appendix), n = m - 15. All possible differences were computed and ranked. Using the 
large sample approximation formula (26), / = 1 12.5 - 1.95596 (24.109) - .5 = 64.844. Thus / = 65 and 
the lower bound is the 65 th difference, - 4. The upper bound is the 65 th difference from the upper end, or 
the 225 - 65 + 1 = 161 st value, 14. The confidence interval is (-4, 14). 



Siegel-Tukev Test 

In 1960, the Siegel-Tukey test was developed, which is similar in procedure to the Wilcoxon 
Rank Sum test for difference in location. This test is based upon the logic that if two samples come from 
populations with the same median, the one with the greater variability will have more extreme scores. 
An advantage of the Siegel-Tukey statistic is that it uses the Wilcoxon table of critical values or can be 
transformed into a U statistic for use with the Mann- Whitney U table of critical values. 

The hypotheses for a two-tailed test are: 

H 0 : There is no difference in spread between the two populations 

//, : There is some difference in spread between the two populations 
Procedure. 

The two combined samples are ordered, keeping track of sample membership. The ranking 
proceeds as follows: the lowest observation is ranked 1, the highest is ranked 2, and the next highest 3. 
Then the second lowest is ranked 4 and the subsequent observation ranked 5. The ranking continues to 
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alternate from lowest to highest, ranking two scores at each end. If there is an odd number of scores, the 
middle score is discarded and the sample size reduced accordingly. Below is an illustration of the 
ranking procedure. 

1 4 5 8 9 ... N ... 7 6 32 

where N = n + m. 

Test statistic 

The sum of ranks is calculated for one sample. The rank sum can be used with a table of critical 
values or it can be transformed into a U statistic by the following formula. 

U ' = R n "«(« + l) (28) 

or 

U' = R m -^m(m + 1 ) ( 29 ) 

Large sample sizes. 

The large-sample approximations are the same for the Siegel-Tukey test as for the Wilcoxon 
Rank Sum or the Mann- Whitney U statistic, formulas (24) and (25). 

Example. 



The Siegel-Tukey statistic was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), n 
= m = 15. The samples were combined and ranked according to the method described. Then tied ranks 
were averaged. The sum of ranks was 220.5 for Sample 1, and 244.5 for Sample 5. The U statistic is 



220.5 



-5( 1 5)( 1 6) - 100.5. The large sample approximation is z = 



100.5 + .5-.5(15 2 ) 

/ 15 2 (31) 

V 12 



0.476998. Because -0.476998 > -1.95996, the null hypothesis cannot be rejected. 



-11.5 

24.109127 



The Mood Test 

In 1954, the Mood test was developed based on the sum of squared deviations of one sample's 
ranks from the average combined ranks. The null hypothesis is that there is no difference in spread 
against the alternative hypothesis that there is some difference. 

Procedure. 

Let sample 1 be .r,,.r,,...,.t m and let sample 2 bey,.y, y„ . Arrange the combined samples in 
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ascending order and rank the observations from 1 to m + n. Let /?, be the rank of Let N = m + n. If N 
is odd, the middle rank is ignored to preserve symmetry. 

Test statistic. 

The test statistics is 



, \2 

m + /i + V ■ 
2 , 

Large sample sizes. 

The large sample approximation is 



m 

M- I 



R, 



i=\ \ 



(30) 



z = 



M m(N 2 -1) 

12 

jmn(N + l)(N 2 - 4) 
180 



(31) 



where N= m + n and M is the test statistic. 

Example. 

The Mood statistic was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), n = m = 
15. The combined samples are ranked, with midranks assigned to ties. The overall mean of the ranks is 
15.5, and the sum of squared deviations of the ranks from the mean for Sample 1 was calculated. 



yielding M - 1257. The large sample approximation is 
0.7 15 12 < 1 .95596, the null hypothesis cannot be rejected. 



1257-1123.75 

V34720 



133.25 

= 0.71512. Because 

186.333 



The Savage Test for Positive Random Variables 

Unlike the Siegel-Tukey test and the Mood test, the Savage test does not assume that location 
remains the same. It is assumed that differences in scale cause a difference in location. The samples are 
assumed to be drawn from continuous distributions. 

The null hypothesis is that there is no difference in spread against the two-tailed alternative, there 
is a difference. 

Procedure. 

Let sample 1 be .r,,x 2 ,...,x m and let sample 2 b ey,,y 2 ,...,y n . The combined samples are 

ordered, keeping track of sample membership. Let R, be the rank for .t,. The test statistic is computed for 
either sample. 
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Test statistic. 



The test statistic is 



S = i,“W 



where 



N 1 

oil 0= z 1 

■*►!-/ J 

such that a(l) = -J-, a(2) = — — + — , ... , a(W) = 1 + - + - + ... + — 

N N-\ N 2 3 N - 1 A 



1 

+ — 



Large sample sizes. 

For large sample sizes the following normal approximation may be used. 

S -n 



S = 



nm 



N - 1 



1 N 1 

i_±yl 



(32) 



(33) 



(35) 



S is compared to the critical z value from the standard normal distribution. 

Example. 

The Savage statistic was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), n = m = 

15. Using Sample 1, S = 18.3114. The large sample approximation is 18.31 14 15 _ 3.1 14 _ 

V7.7586(.86683) 2.59334 

1 .27689. Because 1 .27689 < 1 .95596, the null hypothesis cannot be rejected. 



Ansari-Bradlev Test 

This is a rank test for spread when the population medians are the same. The null hypothesis is 
that the two populations have the same spread against the two-tailed alternative that the spreads of the 
two populations differ. 

Procedure. 

Order the combined samples, keeping track of sample membership. Rank the smallest and largest 
observation 1. Rank the second lowest and second highest 2. If the combined sample size, N, is odd. the 



,/V + l N 

middle score will be ranked — — — and if N is even the middle two ranks will be — . The pattern will be 
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either 1 , 2, 3, ,3,2, l (N odd), or 1 , 2, 3, . . - - 

1 2 ’ 2 

Test statistic. 

The test statistic, W, is the sum of the ranks of sample 1. 



• . . , 3, 2, 1 (A'' even). 



W^R, 

J = 1 

where /?, is the rank of the z th observation of a sample. 
Large sample sizes. 



There are two formulae, one if is even, and one if N\s odd. 



if A r is even and 



W = 



w m(m + n + 2) 

4 

| mn(m + n-H2)(m-i-n-2) 

V 48 (m + n - 1) 



( 35 ) 



(36) 



ip. m(m + n + 1) 2 

M/n + n) 

j mn(m + n + 1)[3 -I- (m + n) 2 ] 
V 48 (m + n) 2 



(37) 



if A^ is odd. Reject the null hypothesis if W* > Z(x/2 . 



Example. 

The Ansan-Bradley statistic was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), 
n = m= 15. The combined samples were ranked using the method described, and tied ranks were 
assigned average ranks. The statistic, W, is 126.5, the rank sum of Sample 5. The large sample 



approximation is 



126.5-120 



Vl44. 827586 
hypothesis cannot be rejected. 



6.5 

12.03443 



0.540117. Because 0.540117 < 1.95596, the null 



Comparisons of Several Populations 

This section considered tests against an omnibus alternative and tests involving an ordered 
hypothesis. The omnibus tests were the Kruskal-Wallis test and Friedman’s test. The tests for ordered 
alternatives are the Terpstra-Jonckheere test, Page's test, and the match test. 

The Kruskal-Wallis test is a test for independent samples. It is analogous to the one-way analysis 
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of variance. Friedman's test is an omnibus test for k related samples, and is analogous to a two-way 
analysis of variance. 

Comparisons of several populations with ordered alternative hypotheses are extensions of a one- 
sided test. Where an omnibus alternative states only that there is some difference between the 
populations, an ordered alternative specifies the order of differences. Three tests for an ordered 
alternative were included, the Terpstra-Jonckheere Test, Page’s Test, and the Match Test. 



Kruskal-Wallis Test 

In 1952, the Kruskal-Wallis test was derived from the F test. It is an extension of the Wilcoxon 
(Mann-Whitney) test. The null hypothesis is that the k populations have the same average (median). The 
alternative hypothesis is that at least one sample is from a distribution with a different average (median). 
Procedure. 

Rank all the observations in the combined samples, keeping track of the sample membership. 
Compute the rank sums of each sample. Let /?, equal the sum of the ranks of the i* sample of sample 

size The logic of the test is that the ranks should be randomly distributed among the k samples. 

Test statistic. 

The formula is 



H = 



12 

N(N + 1) 



I 



R 2 

— -3(W + 1) 

n , 



( 38 ) 



where N is the total sample size, is the size of the /th group, k is the number of groups, and R, is the 
rank-sum of the /th group. Reject Ho when H > critical value. 

Large sample sizes. 

For large sample sizes, the null distribution is approximated by the % 2 distribution with k - I 
degrees of freedom. Thus, the rejection rule is to reject H 0 if H > x \ where is the value of 

at nominal a with A: - 1 degrees of freedom. 

Example. 

The Kruskal-Wallis statistic was calculated using Samples 1 - 5 in Table 3 (Appendix), n , = n : = 
n i — ^4 — n$ — 15. The combined samples were ranked, and tied ranks were assigned midranks. The rank 
sums were: /?, =638, R 2 = 595, R 3 = 441.5, R 4 = 656.5. and = 519. The sum of R? = 1,656,344.5, / = 



o ?9 
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1,2,3, 4,5 .H = 



12 



75-76 



1,656,344,5 

15 



- 3 • 76 = 0.002 1 1(1 1 0,422 .9667) - 228 =4.4694. The statistic, H = 



4.4694. The large sample approximation is chi-square with 5-1=4 degrees of freedom at a = .05 
which is 9.488. Because 4.4694 < 9.488, the null hypothesis cannot be rejected. 



Friedman’s Test 

In 1937, the Friedman test was developed as a test for k related samples. The null hypothesis is 
that the samples come from the same population against the alternative that at least one of the samples 
comes from a different population. The data are arranged in k columns and n rows, where each row 
contains k related observations. 

Procedure. 

Rank the observations for each row from 1 to k. For each of the k columns, the ranks are added 

and averaged, and the mean is designated R f . The overall mean of the ranks is R =^(k + \). The sum 

of the squares of the deviations of mean of the ranks of the columns from the overall mean rank is 
computed. The test statistic is a multiple of this sum. 

Test statistic. 

The test statistic for Friedman’s test is M, which is a multiple of S, as follows: 

S (39) 

h i 



M = 



12k s 
k(k + 1) 



(40) 



where n is the number of rows, and k is the number of columns. An alternate formula that does not use 5 
is as follows. 



12 



M = £ R) - 3 n(k + 1) 



(41) 



nk(k + 1)“ 

where n is the number of rows, k is the number of columns, and Rj is the rank sum for the y th column, j = 
1,2,3, ... ,k. 

Large sample sizes. 

For large sample sizes, the critical values can be approximated by the chi-square distnbution 
with k - 1 degrees of freedom. 



30 



ft-iT COPY AVAILABLE 



< 



27 



Example. 



Friedman’s statistic was calculated with Samples 1 - 5 in Table 3 (Appendix), m=n 2 = n 3 = « 4 = 
n 5 = 15. The rows were ranked, with midranks assigned to tied ranks. The column sums are: R ] = 48.5 
R 2 = 47, R } = 33, R 4 = 52.5, and R s = 44. The sum of the squared rank sums is 10,342.5. The statistic. 



chi-square with 5-1-4 degrees of freedom and a = .05 which is 9.488. Because 5.8 < 9.488, the null 
hypothesis cannot be rejected. 

Terpstra-Jonckheere Test 

This is a test for more than two independent samples. It was first developed by Terpstra in 1952 
and later independently developed by Jonckheere in 1954. The null hypothesis is that the medians of the 
samples are equal against the alternative that the medians are either decreasing or increasing. This test is 
based on the Mann- Whitney U statistic, where U is calculated for each pair of samples and the U 
statistics are added. 

Suppose the null hypothesis is H 0 : «, = m 2 = ... = m t and the alternative hypothesis is 
H \ :m, <m 2 <...<m k for i = 1, 2, . . . k, where m, is the median for sample i. The U statistic is 

calculated for each of the ^ ^ pairs, which are ordered so that the smallest U is calculated. 

Test statistic. 

The test statistic is the sum of the U statistics. 



where Uij is the number of (x, pr,) pairs with xj less than x, 

Large sample sizes. 

The null distribution of W approaches normality as the sample size increases. The mean of the 
distribution is 



and the standard deviation is 



^ 15 . 5 . 6 ^ 0,342.5) 3-15-6 0.02667(10, 342. 5)-270 - 5.8. The large sample approximation is 





( 44 ) 
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The critical value for large samples is given by 



W< M - 




(45) 



where z is the standard normal value, and 



1 . 

— is a continuity correction. 



Example. 

The Terpstra-Jonckheere statistic was calculated with Samples 1 - 5 in Table 3 (Appendix), n, = 
n 2 = n 3 - n 4 = n 5 = 15. This was done as a one-tailed test with a = .05. The U statistics for each sample 
were calculated. C/ 2 . , = 121, C/ 3.1 = 145, C/ 4 ,i = 103, C/ 5f , = 135, C/33 = 142, C/ 4 . 2 = 97, C/5.2 = 124, C/4.3 = 
71, C/5,3 = 91, and C/54 = 136, for a total 1165. The large sample approximation was calculated, with 
M- = 1125 and ct = 106.94625. The approximation is 1125 - 1.6449(106.9463) - .5 = 948.584. Because 
1 165 > 948.584 the null hypothesis cannot be rejected. 



Page’s Test 

In 1963, Page’s test for an ordered hypothesis for k > 2 related samples was developed. It takes 
the form of a randomized block design, with k columns and n rows. The null hypothesis is 
H 0 :m { = m 2 = ... = m k and the alternative hypothesis is H { : m, < m 2 < ... < m k for i = 1, 2, . . . k. For 

this test, the alternative must be of this form. The samples need to be reordered if necessary. 

Procedure. 

The data are ranked from 1 to k for each row, creating a table of the ranks. The ranks of each of 
the k columns are totaled. If the null hypothesis is true, the ranks should be evenly distributed over the 
columns, whereas if the alternative is true, the ranks sums should increase with the column index. 

Test statistic. 

Each column rank-sum is multiplied by the column index. The test statistic is 

* 

^ = X'*, (46) 

where i is the column index, 1 = 1, 2, 3, .... k, and R, is the rank sum for the j th column. 

Large sample sizes 
The mean of L is 

nk( k + 1 ) : 

M = ; ( 47 ) 

4 
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and the standard deviation is 



a = 






144 



For a given a, the approximate critical region is 



(48) 



Example. 



, . 1 
L > U + ZG + — 
2 



(49) 



Page’s statistic was calculated with Samples 1 - 5 in Table 3 (Appendix), = * 2 = „ 3 = „ A = „ 5 = 
15. This was done as a one-tailed test with a = .05. The rows are ranked with midranks assigned to tied 
ranks. The column sums are : R { = 48.5, R 2 = 47, R } = 33, R 4 = 52.5, and R 5 = 44. The statistic, L, is the 
sum of iRi = 671.5, where / = 1, 2, 3, 4, 5. The large sample approximation was calculated with p = 675 
and ct = 19.3649. The approximation is 675 + 1.64485(19.3649) + .5 = 707.352. Because 671.5 < 
707.352, the null hypothesis cannot be rejected. 



The Match Test for Ordered Alternatives 

The match test is a test for k > 2 related samples with an ordered alternative hypothesis. The 
match test was developed by Neave and Worthington (1988). It is very similar in concept to Page’s test, 
but instead of using rank-sums, it uses the number of matches of the ranks with the expected ranks plus 
half the near matches. 

The hypotheses are the same as for Page’s test. The null hypothesis is H 0 : m ] = m 2 = ... = m k and 
the alternative hypothesis is //, : m, < m, < ... < for z = 1 , 2, . . . k. 

Procedure. 

A table of ranks is compiled with the observations in each row ranked from 1 to k. Ties are 
assigned average ranks. Each rank, r„ is compared with the expected rank, which is the column index. 
If the rank equals the column index, it is a match. The number of matches is counted. Every non-match 
such that 0.5 < |r, - / 1 <1.5 is counted as a near match. 

Test statistic. 

The test statistic is 

L 2 = L x +-^( number of near matches) (50) 

where L\ is the number of matches. 
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Large s ample sizes 

The null distribution approaches a normal distribution for large sample size The 
standard deviation for £, are as follows: P ' Th and 

' 2 - 1 ) 

kj (51) 



M = n 



- 


f 3(k-2)' 


1 


V* 


2 J 


H 

1 



■ * / m* -i; 

For a given level of significance a the critical value approximation is 



(52) 



L 2 > ju + zcr + — 



(53) 



Where z is the upper-tail critical value from the standard norma, dtstribution and I is a continuity 
correction. 2 

Example. 

The Match statistic was calculated with Samples 1 - 5 in Table 3 (Appendix), 

t^d d0nC 35 3 ° ne ' tailed tCSt W ' th “ = ° 5 - The rows ^ ranked with midranks assigned to 

tied ranks. The number of matches for the five columns are 3 3 2 2 and 1 for T - ii n, 

near matches were 1,6, 8 8 and 4 for! -27 Th , ’ ' ~ ^ ° f 

... d 4, for L 2 - 27. The statistic, £= 11 + .5(27) = 24.5. For the large 

sample approximation, u = 27 and a = 3 fiSim 

F and cr 3.68103. The approximation is 27 + 1.6449(3.68103) + 5 = 

33.5549. Because 24.5 < 33.5549, the null hypothesis cannot be rejected. 

Rank Correlation Tests 

ank correlation is a measure of the association of a pair of variables. Two tests of 

association were studied, Spearman’s rank correlation coefficient (rho) and Kendall’s rank correlation 
coefficient (tau). 



Spearman’s Rank Co rrelation Coefficient 

Spearman's rank correlation (rho) was published in 1904. Le, A and The the two variables of 
interests. Each obsetved pair is denoted by (x„ >,). The paired ranks are denoted by (r, *) where r, is the 

rank of x, and s, is the rank of y, . The null hypothesis for a two-tailed test is H, : p. 0 against the 
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alternative, H t : p * 0 . The alternative hypotheses for a one-tailed test are H x \ p> 0 or H x : p < 0. 
Procedure. 

Rank both X and Y scores while keeping track of the original pairs. Form the rank pairs (/•„ 5 , ) 
which correspond to the original pair, (x„ y,). Calculate the sum of the squared differences between r, 
and 

Test statistic. 

If there are no ties, the formula is 



where 



p = l- 



6T 

n(n 2 - 1) 



(54) 



T = 'Z(n~s i ) 2 (55) 

Large sample sizes. 

For large n the distribution of p is approximately normal. The critical values can be found by 

z = pyjn -1 . The rejection rule for a two-tailed test is to reject H 0 if z > Zaji or z < - z^ 2 where z^ is the 
critical value for the given level of significance. 

Example. 

Spearman’s rho was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), n = 15. The 

sum of the squared rank differences for the two samples is T= 839. Rho is 1- - \ _ 5034 = ; _ 

15(224) 3360 

1.498214 = -0.498214. The large sample approximation is z = -0.498214 ~J\A = -1.864147. Because - 
1 .864 > -1 .956, the null hypothesis cannot be rejected. 



Kendall’s Rank Correlation Coefficient 

Kendall’s rank correlation coefficient (tau) is similar to Spearman’s rho. The underlying concept 
is the tendency for concordance. Concordance is the concept that if x. > x y then y j > y j . Concordance 
implies that the differences x, -x y and_y, -y, have the same sign, either “+” or Discordant pairs are 

pairs that have opposite signs, that is, x > x j but v, < y 7 , or the opposite, x, < x y but y. > y ■ A high 

number of concordant pairs support the alternative hypothesis of positive, and correlation, a high 
number of discordant pairs support an alternative hypothesis of negative correlation. 
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Procedure. 



Arrange the pairs in ascending order of X. Count the number ofy, which are smaller than v„ This 
is the number of discordant pairs (N D ) for Repeat the process for each subsequent x, , counting the 
number of smaller y, to the right of the J = / + 1, / + 2, i + 3 n . 

Test statistic. 

Because the total number of pairs is ^n(n - 1) , it follows that N c =!«(«- 1) _ n d . The statistic 
( r ) is defined as 



r _ N C -N D 
X -n{n-\) 

This formula can be simplified by substituting N c = ^n(n- l)-N D into the formula so that 

4N 

T = 1 — ° 



(56) 



«(«-!) 



(57) 



From this formula, it can be seen that if there are no discordant pairs, r equals 1, showing positive 



correlation. If all pairs are discordant, 4 N D = 4 (^)n(n - 1) = 2 n(n -l)and T - 1 - 2 - - 1, showing 

negative correlation. 

Large sample sizes. 

For large sample sizes, the formula is 



3r^Jn(n - 1) 
72(2/i + 5) 



(58) 



where z is compared to the z score from the standard normal distribution for the appropriate alpha level. 
Example. 

Kendall’s tau was calculated using Sample 1 and Sample 5 in Table 3 (Appendix), n = 15. The 
number of discordant pairs for each pair, (jt|, .r 5 ), were 12, 8, 8, 5, 9, 5, 6, 3, 5, 3, 0, 3, 0, 1, and 0. The 



4 68 272 

total number of discordant pairs, N D is 68. Tau is 1 — = 1 = -0.295238. The large sample 

15-14 210 5 



approximation is 



3(-.295238)V(15)(14) 

72 ( 35 ) 



null hypothesis cannot be rejected. 



-12.83522 

8.3666 



= -1.534102. Because -1.534102 >-1.95596, the 
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Appendix 



Table 3. Samples Randomly Selected from Micerri’s 
Multimodal Lumpy Data Set. 



Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 



20 

33 
4 

34 
13 
6 

29 

17 

39 

26 

13 

9 

33 

16 

36 



11 

34 

23 
37 
11 

24 
5 
9 
11 
33 
32 
18 
27 
21 
8 



9 

14 

33 

5 

8 

14 

20 

18 

8 

22 

11 

33 

20 

7 

7 



34 

10 

38 

41 

4 

26 

10 

21 

13 

15 

35 

43 

13 

20 

13 



10 

2 

32 
4 

33 

19 
11 
21 
9 
31 
12 

20 
33 
15 
15 





Table 4. Multimodal Lumpy Set (Micceri, 1989). 



Score 


Cumulative 

Frequency 


cdf 


Score 


Cumulative 

Frequency 


cdf 


0 


5 


0.01071 


22 


269 


0.57602 


1 


13 


0.02784 


23 


279 


0.59743 


2 


21 


0.04497 


24 


. 282 ' 


0.60385 


3 


24 


0.05139 


25 


287 


0.61456 


4 


32 


0.06852 


26 


297 


0.63597 


5 


38 


0.08137 


27 


306 


0.65525 


6 


41 


0.08779 


28 


309 


0.66167 


7 


50 


0.10707 


29 


319 


0.68308 


8 


62 


0.13276 


30 


325 


0.69593 


9 


80 


0.17131 


31 


336 


0.71949 


10 


91 


0.19486 


32 


351 


0.75161 


11 


114 


0.24411 


33 


364 


0.77944 


12 


136 


0.29122 


34 


379 


0.81156 


13 


160 


0.34261 


35 


389 


0.83298 


14 


180 


0.38544 


36 


401 


0.85867 


15 


195 


0.41756 


37 


418 


0.89507 


16 


213 


0.45610 


38 


428 


0.91649 


17 


225 


0.48180 


39 


434 


0.92934 


18 


234 


0.50107 


40 


445 


0.95289 


19 


244 


0.52248 


41 


454 


0.97216 


20 


254 


0.54390 


42 


460 


0.98501 


21 


261 


0.55889 


43 


467 


1.00000 
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